{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Lesson 05 - Pandas part 2\n", "\n", "Welcome to lesson 6! In this lesson we will introduce pandas as our main way of storing data. NumPy will be useful when we have a uniform multidimensional data set, but for now pandas will be our default.\n", "\n", "Again, an exhaustive run through of pandas is too much for this class - see the book - [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) or the [official pandas website](http://pandas.pydata.org/pandas-docs/version/0.17.1/).\n", "\n", "If you are familiar with [R](http://pandas.pydata.org/pandas-docs/version/0.17.1/comparison_with_r.html), [SAS](http://pandas.pydata.org/pandas-docs/version/0.17.1/comparison_with_sas.html), and/or [SQL](http://pandas.pydata.org/pandas-docs/version/0.17.1/comparison_with_sql.html), click on the links to lead you to the intro to pandas for users of each language.\n", "\n", "Please download todays notebook [here](/pythoncourse/assests/notebooks/applied/lesson 05 applied.ipynb).\n", "\n", "### Data Import\n", "\n", "Importing data is the most important first step to get our data in. Today we will cover read_csv, before we finish the course we will talk about how to connect to your netezza (and other SQL) databases" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from pandas import DataFrame, Series\n", "import pandas as pd\n", "import io\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have a ton of ways of reading data into and writing data out of pandas. See the [dataIO page](http://pandas.pydata.org/pandas-docs/stable/io.html) for more details." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "date,A,B,C\n", "20090101,a,1,2\n", "20090102,b,3,4\n", "20090103,c,4,5\n", "\n" ] } ], "source": [ "#using a string as example\n", "#we could refer to file names if we had the file saved\n", "data = '''\n", "date,A,B,C\n", "20090101,a,1,2\n", "20090102,b,3,4\n", "20090103,c,4,5\n", "'''\n", "print(data)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateABC
020090101a12
120090102b34
220090103c45
\n", "
" ], "text/plain": [ " date A B C\n", "0 20090101 a 1 2\n", "1 20090102 b 3 4\n", "2 20090103 c 4 5" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#by default, the index is arange(nrows)\n", "pd.read_csv(io.StringIO(data))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
date
20090101a12
20090102b34
20090103c45
\n", "
" ], "text/plain": [ " A B C\n", "date \n", "20090101 a 1 2\n", "20090102 b 3 4\n", "20090103 c 4 5" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#we can specify the index:\n", "pd.read_csv(io.StringIO(data), index_col=0)\n", "#also index_col='date'" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BC
dateA
20090101a12
20090102b34
20090103c45
\n", "
" ], "text/plain": [ " B C\n", "date A \n", "20090101 a 1 2\n", "20090102 b 3 4\n", "20090103 c 4 5" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#we can also use nested indices:\n", "pd.read_csv(io.StringIO(data), index_col=['date','A'])" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
date
20090101a12
20090102b34
20090103c45
\n", "
" ], "text/plain": [ " A B C\n", "date \n", "20090101 a 1 2\n", "20090102 b 3 4\n", "20090103 c 4 5" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#we can specify data type (it will speed things up, or avoid conversion)\n", "pd.read_csv(io.StringIO(data), index_col=['date'], dtype={'A' : str, 'B':np.int32, 'C':np.float64})" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
foobarbaz
20090101a12
20090102b34
20090103c45
\n", "
" ], "text/plain": [ " foo bar baz\n", "20090101 a 1 2\n", "20090102 b 3 4\n", "20090103 c 4 5" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#We can throw out names and use our own\n", "pd.read_csv(io.StringIO(data), index_col=[0],\n", " dtype={'A' : str, 'B':np.int32, 'C':np.float64},\n", " names=[\"foo\", 'bar', \"baz\"],\n", " header = 0)" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
foobaz
0a2
1b4
2c5
\n", "
" ], "text/plain": [ " foo baz\n", "0 a 2\n", "1 b 4\n", "2 c 5" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#filter out some unneeded columns:\n", "pd.read_csv(io.StringIO(data),\n", " names=['date', 'foo', 'bar', \"baz\"],\n", " header = 0,\n", " usecols = ['foo', 'baz'])" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "DatetimeIndex(['2009-01-01', '2009-01-02', '2009-01-03'], dtype='datetime64[ns]', name='date', freq=None)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#dates! more to come\n", "dat = pd.read_csv(io.StringIO(data),\n", " parse_dates = True,\n", " index_col = [0]\n", " )\n", "dat.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### DataFrames\n", "\n", "DataFrames are similar to a dict of a series - technically they are a 2d series with some linking between levels.\n", "\n", "Columns are arrays (must be one data type), and rows are similar to dicts.\n", "\n", "However, the row/column mapping is not as strictly enforced as R." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TripTypeVisitNumberWeekdayUpcScanCountDepartmentDescriptionFinelineNumber
1307Friday605388159801SHOES8931
2307Friday74108110991PERSONAL CARE4504
3268Friday22384035102PAINT AND ACCESSORIES3565
4268Friday20066137442PAINT AND ACCESSORIES1017
5268Friday20066187832PAINT AND ACCESSORIES1017
6268Friday20066137431PAINT AND ACCESSORIES1017
7268Friday70048027371PAINT AND ACCESSORIES2802
8268Friday22384953181PAINT AND ACCESSORIES4501
9268Friday2238400200-1PAINT AND ACCESSORIES3565
10268Friday52000102391DSD GROCERY4606
11268Friday886793005012PAINT AND ACCESSORIES3504
12268Friday220060000001MEAT - FRESH & FROZEN6009
13268Friday22367604521PAINT AND ACCESSORIES7
14268Friday88679300501-1PAINT AND ACCESSORIES3504
15268Friday22384002002PAINT AND ACCESSORIES3565
16268Friday30192942031PAINT AND ACCESSORIES2801
17268Friday724504088401PAINT AND ACCESSORIES1028
18268Friday255415000002DAIRY1305
19268Friday23100107761PETS AND SUPPLIES3300
\n", "
" ], "text/plain": [ " TripType VisitNumber Weekday Upc ScanCount \\\n", "1 30 7 Friday 60538815980 1 \n", "2 30 7 Friday 7410811099 1 \n", "3 26 8 Friday 2238403510 2 \n", "4 26 8 Friday 2006613744 2 \n", "5 26 8 Friday 2006618783 2 \n", "6 26 8 Friday 2006613743 1 \n", "7 26 8 Friday 7004802737 1 \n", "8 26 8 Friday 2238495318 1 \n", "9 26 8 Friday 2238400200 -1 \n", "10 26 8 Friday 5200010239 1 \n", "11 26 8 Friday 88679300501 2 \n", "12 26 8 Friday 22006000000 1 \n", "13 26 8 Friday 2236760452 1 \n", "14 26 8 Friday 88679300501 -1 \n", "15 26 8 Friday 2238400200 2 \n", "16 26 8 Friday 3019294203 1 \n", "17 26 8 Friday 72450408840 1 \n", "18 26 8 Friday 25541500000 2 \n", "19 26 8 Friday 2310010776 1 \n", "\n", " DepartmentDescription FinelineNumber \n", "1 SHOES 8931 \n", "2 PERSONAL CARE 4504 \n", "3 PAINT AND ACCESSORIES 3565 \n", "4 PAINT AND ACCESSORIES 1017 \n", "5 PAINT AND ACCESSORIES 1017 \n", "6 PAINT AND ACCESSORIES 1017 \n", "7 PAINT AND ACCESSORIES 2802 \n", "8 PAINT AND ACCESSORIES 4501 \n", "9 PAINT AND ACCESSORIES 3565 \n", "10 DSD GROCERY 4606 \n", "11 PAINT AND ACCESSORIES 3504 \n", "12 MEAT - FRESH & FROZEN 6009 \n", "13 PAINT AND ACCESSORIES 7 \n", "14 PAINT AND ACCESSORIES 3504 \n", "15 PAINT AND ACCESSORIES 3565 \n", "16 PAINT AND ACCESSORIES 2801 \n", "17 PAINT AND ACCESSORIES 1028 \n", "18 DAIRY 1305 \n", "19 PETS AND SUPPLIES 3300 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dat = pd.read_csv(\"http://jeremy.kiwi.nz/pythoncourse/assets/tests/r&d/test1data.csv\")[1:20]\n", "dat" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# get the column names\n", "dat.columns\n", "#get the first five rows\n", "dat.head()\n", "#pick out specific columns\n", "DataFrame(dat,columns=['TripType','VisitNumber'])\n", "#same as\n", "dat[['TripType','VisitNumber']]\n", "#get one specific column\n", "dat.TripType\n", "#get one specific column\n", "dat['TripType']\n", "#if we want rows, we need ix, loc or iloc\n", "#loc works on labels in the index:\n", "dat.loc[1]\n", "#iloc works on integer index:\n", "dat.iloc[1]\n", "#ix tries to work like loc, but falls back to iloc if not found:\n", "dat.ix[1]\n", "#add a new column\n", "dat['foo']=\"spam\"\n", "#using other columns:\n", "dat['foo'] = dat['VisitNumber'] + dat['ScanCount']\n", "#add a new column with specific values\n", "dat['foo']=Series(['spam', 'more spam'],index=[4,10])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#delete a column\n", "del dat['foo']\n", "#'http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html'" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TripTypeVisitNumberWeekdayUpcScanCountDepartmentDescriptionFinelineNumber
0NaNNaNNaNNaNNaNNaNNaN
1307Friday605388159801SHOES8931
2307Friday74108110991PERSONAL CARE4504
3268Friday22384035102PAINT AND ACCESSORIES3565
4268Friday20066137442PAINT AND ACCESSORIES1017
\n", "
" ], "text/plain": [ " TripType VisitNumber Weekday Upc ScanCount \\\n", "0 NaN NaN NaN NaN NaN \n", "1 30 7 Friday 60538815980 1 \n", "2 30 7 Friday 7410811099 1 \n", "3 26 8 Friday 2238403510 2 \n", "4 26 8 Friday 2006613744 2 \n", "\n", " DepartmentDescription FinelineNumber \n", "0 NaN NaN \n", "1 SHOES 8931 \n", "2 PERSONAL CARE 4504 \n", "3 PAINT AND ACCESSORIES 3565 \n", "4 PAINT AND ACCESSORIES 1017 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#recall, indexes are immutable?\n", "#how to reindex?\n", "dat = dat.reindex(np.arange(5))\n", "dat" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TripTypeVisitNumberWeekdayUpcScanCountDepartmentDescriptionFinelineNumber
0NaNNaNNaNNaNNaNNaNNaN
1307Friday605388159801SHOES8931
2307Friday74108110991PERSONAL CARE4504
3268Friday22384035102PAINT AND ACCESSORIES3565
4268Friday20066137442PAINT AND ACCESSORIES1017
5268Friday20066137442PAINT AND ACCESSORIES1017
6268Friday20066137442PAINT AND ACCESSORIES1017
7268Friday20066137442PAINT AND ACCESSORIES1017
8268Friday20066137442PAINT AND ACCESSORIES1017
9268Friday20066137442PAINT AND ACCESSORIES1017
\n", "
" ], "text/plain": [ " TripType VisitNumber Weekday Upc ScanCount \\\n", "0 NaN NaN NaN NaN NaN \n", "1 30 7 Friday 60538815980 1 \n", "2 30 7 Friday 7410811099 1 \n", "3 26 8 Friday 2238403510 2 \n", "4 26 8 Friday 2006613744 2 \n", "5 26 8 Friday 2006613744 2 \n", "6 26 8 Friday 2006613744 2 \n", "7 26 8 Friday 2006613744 2 \n", "8 26 8 Friday 2006613744 2 \n", "9 26 8 Friday 2006613744 2 \n", "\n", " DepartmentDescription FinelineNumber \n", "0 NaN NaN \n", "1 SHOES 8931 \n", "2 PERSONAL CARE 4504 \n", "3 PAINT AND ACCESSORIES 3565 \n", "4 PAINT AND ACCESSORIES 1017 \n", "5 PAINT AND ACCESSORIES 1017 \n", "6 PAINT AND ACCESSORIES 1017 \n", "7 PAINT AND ACCESSORIES 1017 \n", "8 PAINT AND ACCESSORIES 1017 \n", "9 PAINT AND ACCESSORIES 1017 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dat.reindex(np.arange(7),fill_value=0)\n", "dat.reindex(np.arange(10),method='ffill')" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TripTypeVisitNumberWeekdayUpcScanCountDepartmentDescriptionFinelineNumber
0NaNNaNNaNNaNNaNNaNNaN
2307Friday74108110991PERSONAL CARE4504
3268Friday22384035102PAINT AND ACCESSORIES3565
4268Friday20066137442PAINT AND ACCESSORIES1017
\n", "
" ], "text/plain": [ " TripType VisitNumber Weekday Upc ScanCount \\\n", "0 NaN NaN NaN NaN NaN \n", "2 30 7 Friday 7410811099 1 \n", "3 26 8 Friday 2238403510 2 \n", "4 26 8 Friday 2006613744 2 \n", "\n", " DepartmentDescription FinelineNumber \n", "0 NaN NaN \n", "2 PERSONAL CARE 4504 \n", "3 PAINT AND ACCESSORIES 3565 \n", "4 PAINT AND ACCESSORIES 1017 " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dat.drop(1)\n", "#dat.drop('foo', axis = 1)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TripTypeVisitNumberWeekdayUpcScanCountDepartmentDescriptionFinelineNumber
3268Friday22384035102PAINT AND ACCESSORIES3565
4268Friday20066137442PAINT AND ACCESSORIES1017
\n", "
" ], "text/plain": [ " TripType VisitNumber Weekday Upc ScanCount \\\n", "3 26 8 Friday 2238403510 2 \n", "4 26 8 Friday 2006613744 2 \n", "\n", " DepartmentDescription FinelineNumber \n", "3 PAINT AND ACCESSORIES 3565 \n", "4 PAINT AND ACCESSORIES 1017 " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#getting data\n", "dat[['TripType','Upc']]\n", "dat['ScanCount']>1\n", "dat[dat['ScanCount']>1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Quick Review\n", "\n", "I mentioned in the previous lecture we can use all our base and NumPy methods on pandas DataFrames: Here is a quick review taken from the SQL lesson:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
323.683.31MaleNoSunDinner2
424.593.61FemaleNoSunDinner4
\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size\n", "0 16.99 1.01 Female No Sun Dinner 2\n", "1 10.34 1.66 Male No Sun Dinner 3\n", "2 21.01 3.50 Male No Sun Dinner 3\n", "3 23.68 3.31 Male No Sun Dinner 2\n", "4 24.59 3.61 Female No Sun Dinner 4" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips = pd.read_csv('https://raw.github.com/pydata/pandas/master/pandas/tests/data/tips.csv')\n", "tips.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SQL select:\n", "```\n", "SELECT total_bill, tip, smoker, time\n", "FROM tips\n", "LIMIT 5;\n", "```" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsmokertime
016.991.01NoDinner
110.341.66NoDinner
221.013.50NoDinner
323.683.31NoDinner
424.593.61NoDinner
\n", "
" ], "text/plain": [ " total_bill tip smoker time\n", "0 16.99 1.01 No Dinner\n", "1 10.34 1.66 No Dinner\n", "2 21.01 3.50 No Dinner\n", "3 23.68 3.31 No Dinner\n", "4 24.59 3.61 No Dinner" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips[['total_bill', 'tip', 'smoker', 'time']].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SQL where:\n", "```\n", "SELECT *\n", "FROM tips\n", "WHERE time = 'Dinner' AND tip > 5.00;\n", "LIMIT 5;\n", "```" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesize
2339.427.58MaleNoSatDinner4
4430.405.60MaleNoSunDinner4
4732.406.00MaleNoSunDinner4
5234.815.20FemaleNoSunDinner4
5948.276.73MaleNoSatDinner4
\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size\n", "23 39.42 7.58 Male No Sat Dinner 4\n", "44 30.40 5.60 Male No Sun Dinner 4\n", "47 32.40 6.00 Male No Sun Dinner 4\n", "52 34.81 5.20 Female No Sun Dinner 4\n", "59 48.27 6.73 Male No Sat Dinner 4" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Groupby\n", "\n", "Grouping and summarising data allows us to carry out the key data analysis steps of [split, apply, combine](https://www.jstatsoft.org/article/view/v040i01/v40i01.pdf). The journal article by Hadley Wickham was one of the first formalisations of the split apply combine paradigm, and we can of course do it in Python.\n", "\n", "* Splitting the data into groups based on some criteria\n", "* Applying a function to each group independently\n", "* Combining the results into a data structure\n", "\n", "Let's continue on with our analysis of the tips data:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tipsgroups = tips.groupby('sex')\n", "tipsgroups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have a new data type, the groupby object.\n", "We can access the attribute, groups. This is a dict, with each level as it's own entry and the indices of the original data frame:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Female\n", "Male\n" ] } ], "source": [ "for i,j in tipsgroups.groups.items():\n", " print(i)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can do grouping on any axis, or with a custom function (this example is pathological):" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{0: ['tip', 'sex', 'day', 'time', 'size'], 1: ['total_bill', 'smoker']}" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def myfun(index):\n", " if len(index) >= 5:\n", " return 1\n", " else:\n", " return 0\n", "\n", "group2 = tips.groupby(myfun, axis = 1)\n", "group2.groups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use tab completion to see all out methods and attributes:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsize
sex
Female18.0568972.8334482.459770
Male20.7440763.0896182.630573
\n", "
" ], "text/plain": [ " total_bill tip size\n", "sex \n", "Female 18.056897 2.833448 2.459770\n", "Male 20.744076 3.089618 2.630573" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from matplotlib import pyplot as plt\n", "%matplotlib inline\n", "tipsgroups.mean()\n", "#tipsgroups.boxplot();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can iterate through groups:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Female\n", " total_bill tip sex smoker day time size\n", "0 16.99 1.01 Female No Sun Dinner 2\n", "4 24.59 3.61 Female No Sun Dinner 4\n", "11 35.26 5.00 Female No Sun Dinner 4\n", "14 14.83 3.02 Female No Sun Dinner 2\n", "16 10.33 1.67 Female No Sun Dinner 3\n", "Male\n", " total_bill tip sex smoker day time size\n", "1 10.34 1.66 Male No Sun Dinner 3\n", "2 21.01 3.50 Male No Sun Dinner 3\n", "3 23.68 3.31 Male No Sun Dinner 2\n", "5 25.29 4.71 Male No Sun Dinner 4\n", "6 8.77 2.00 Male No Sun Dinner 2\n" ] } ], "source": [ "for name, group in tipsgroups:\n", " print(name)\n", " print(group.head(5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To apply, we can use .aggregate:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "sex\n", "Female 2.833448\n", "Male 3.089618\n", "Name: tip, dtype: float64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tipsgroups.aggregate(np.mean)\n", "#selecting columns:\n", "tipsgroups['tip'].aggregate(np.mean)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsize
meansumstdmeansumstdmeansumstd
sex
Female18.0568971570.958.0092092.833448246.511.1594952.4597702140.937644
Male20.7440763256.829.2464693.089618485.071.4891022.6305734130.955997
\n", "
" ], "text/plain": [ " total_bill tip size \\\n", " mean sum std mean sum std mean \n", "sex \n", "Female 18.056897 1570.95 8.009209 2.833448 246.51 1.159495 2.459770 \n", "Male 20.744076 3256.82 9.246469 3.089618 485.07 1.489102 2.630573 \n", "\n", " \n", " sum std \n", "sex \n", "Female 214 0.937644 \n", "Male 413 0.955997 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#.agg is short for agg\n", "tipsgroups.agg([np.mean, np.sum, np.std])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tipsize
meansum<lambda>
sex
Female2.833448246.516
Male3.089618485.076
\n", "
" ], "text/plain": [ " tip size\n", " mean sum \n", "sex \n", "Female 2.833448 246.51 6\n", "Male 3.089618 485.07 6" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#we can also use a dict, to do different things to different rows:\n", "tipsgroups.agg({'tip': [np.mean, np.sum], 'size':lambda x: max(x)})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also filter, transform, plot, count etc etc. Take a look in the help for more details!\n", "\n", "### Joins\n", "\n", "We can use a variety of joins in pandas, the most basic using the concat function:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df1 = DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\n", " 'B': ['B0', 'B1', 'B2', 'B3'],\n", " 'C': ['C0', 'C1', 'C2', 'C3'],\n", " 'D': ['D0', 'D1', 'D2', 'D3']},\n", " index=[0, 1, 2, 3])\n", "df2 = DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],\n", " 'B': ['B4', 'B5', 'B6', 'B7'],\n", " 'C': ['C4', 'C5', 'C6', 'C7'],\n", " 'D': ['D4', 'D5', 'D6', 'D7']},\n", " index=[4, 5, 6, 7])" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
0A0B0C0D0
1A1B1C1D1
2A2B2C2D2
3A3B3C3D3
4A4B4C4D4
5A5B5C5D5
6A6B6C6D6
7A7B7C7D7
\n", "
" ], "text/plain": [ " A B C D\n", "0 A0 B0 C0 D0\n", "1 A1 B1 C1 D1\n", "2 A2 B2 C2 D2\n", "3 A3 B3 C3 D3\n", "4 A4 B4 C4 D4\n", "5 A5 B5 C5 D5\n", "6 A6 B6 C6 D6\n", "7 A7 B7 C7 D7" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#joins on index\n", "pd.concat([df1, df2])" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCDABCD
0A0B0C0D0NaNNaNNaNNaN
1A1B1C1D1NaNNaNNaNNaN
2A2B2C2D2NaNNaNNaNNaN
3A3B3C3D3NaNNaNNaNNaN
4NaNNaNNaNNaNA4B4C4D4
5NaNNaNNaNNaNA5B5C5D5
6NaNNaNNaNNaNA6B6C6D6
7NaNNaNNaNNaNA7B7C7D7
\n", "
" ], "text/plain": [ " A B C D A B C D\n", "0 A0 B0 C0 D0 NaN NaN NaN NaN\n", "1 A1 B1 C1 D1 NaN NaN NaN NaN\n", "2 A2 B2 C2 D2 NaN NaN NaN NaN\n", "3 A3 B3 C3 D3 NaN NaN NaN NaN\n", "4 NaN NaN NaN NaN A4 B4 C4 D4\n", "5 NaN NaN NaN NaN A5 B5 C5 D5\n", "6 NaN NaN NaN NaN A6 B6 C6 D6\n", "7 NaN NaN NaN NaN A7 B7 C7 D7" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#joins on index!\n", "pd.concat([df1, df2], axis = 1)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
0A0B0C0D0
1A1B1C1D1
2A2B2C2D2
3A3B3C3D3
4A4B4C4D4
5A5B5C5D5
6A6B6C6D6
7A7B7C7D7
\n", "
" ], "text/plain": [ " A B C D\n", "0 A0 B0 C0 D0\n", "1 A1 B1 C1 D1\n", "2 A2 B2 C2 D2\n", "3 A3 B3 C3 D3\n", "4 A4 B4 C4 D4\n", "5 A5 B5 C5 D5\n", "6 A6 B6 C6 D6\n", "7 A7 B7 C7 D7" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#we can ignore the index!\n", "df1.append(df2, ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more control, we might want to explicitly use merge!\n", "\n", "We have the standard joins - inner, outer, left, right, full and union:" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keyvalue_xvalue_y
0A-2.275516NaN
1B0.0505532.301355
2C0.943035NaN
3D-0.2375170.449717
4D-0.237517-1.692712
5ENaN-1.079463
\n", "
" ], "text/plain": [ " key value_x value_y\n", "0 A -2.275516 NaN\n", "1 B 0.050553 2.301355\n", "2 C 0.943035 NaN\n", "3 D -0.237517 0.449717\n", "4 D -0.237517 -1.692712\n", "5 E NaN -1.079463" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = DataFrame({'key': ['A', 'B', 'C', 'D'],\n", " 'value': np.random.randn(4)})\n", "df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],\n", " 'value': np.random.randn(4)})\n", "#SQL:\n", "#SELECT *\n", "#FROM df1\n", "#INNER JOIN df2\n", "# ON df1.key = df2.key;\n", "pd.merge(df1, df2, on='key')\n", "#SQL:\n", "#SELECT *\n", "#FROM df1\n", "#LEFT OUTER JOIN df2\n", "# ON df1.key = df2.key;\n", "pd.merge(df1, df2, on='key', how='left')\n", "#SQL:\n", "#SELECT *\n", "#FROM df1\n", "#RIGHT OUTER JOIN df2\n", "# ON df1.key = df2.key;\n", "pd.merge(df1, df2, on='key', how='right')\n", "#SQL:\n", "#SELECT *\n", "#FROM df1\n", "#FULL OUTER JOIN df2\n", "# ON df1.key = df2.key;\n", "pd.merge(df1, df2, on='key', how='outer')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have not covered a bunch of stuff: [Pivot tables and reshaping](http://pandas.pydata.org/pandas-docs/stable/reshaping.html), [window functions (which were completely updated on Sunday)](http://pandas.pydata.org/pandas-docs/version/0.18.0/whatsnew.html#whatsnew-0180-enhancements-moments) and [time series](http://pandas.pydata.org/pandas-docs/stable/timeseries.html). We will cover these as we continue throughout the course.\n", "\n", "### Example\n", "\n", "From here, we will look at a worked example of data analysis using pandas. In the first lesson, we looked at the example of the tennis fixing scandal, and briefly ran through it. Now we have the skills and knowledge to walk through it, and assess the analysis.\n", "\n", "Here's the link to the [original article](http://www.buzzfeed.com/heidiblake/the-tennis-racket) and the [notebook on github](https://github.com/BuzzFeedNews/2016-01-tennis-betting-analysis) \n", "\n", "### Exercises\n", "\n", "* Update pandas, using conda. Read the documentation for the new window functions.\n", "\n", "* Read through the tennis example, and make sure you understand the basic idea of what is being done\n", "\n", "* Rewrite cell 5, so that it does not use the ~ for negation - you might need to google how to do this\n", "\n", "* Write a function or statement to find how many of the report_players are in your list of significant outliers (the output of cell 21)\n", "\n", "* Read the following csv in, using io.StringIO. There is no header!\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data = \"\"\"\\\n", "10/08/2012,12:10:10,name1,0.81,4.02,50;18.5701400N,4;07.7693770E,7.92,10.50,0.0106,4.30,0.0301\n", "10/08/2012,12:10:11,name2,-999,-999,-999,-999,10.87,1.40,0.0099,9.70,0.0686\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Read in the data as above, but with -999 being a missing value\n", "\n", "* (advanced) From the tennis data frame, find the player (by hash) who has the worst winning record. Find the player with the best winning record (remember, we have a row for each betting agent on each match!)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }